Introduction. The biological problem

  • In this class we will shift our attention to problems of variability in a more direct biological context. We will be discussing how sequence variability may affect gene regulation through the study of regulatory motifs, before we go on to see how genomic variability in general shapes phenotypic changes in individuals.

We will focus on two particular problems:

  • A. The study, analysis and interpretation of biological sequence motifs.
  • B. The concept of genomic variability and how it is directly related to phenotypical variation in a population.

Part A. Sequence Motifs

Motifs in general

What is a motif in general? How do we understand a motif?

Motifs in Biology

Motifs. A definition

We consider a motif a “repetitive substructure of a bigger object that is meant to convey a certain message”

  • Most (in not all) forms of messages employ “motifs” for:
    • repat/emphasis : e.g. “I have a dream”, MLK, “March on Jobs and Freedom Speech” “Crawth the raven, Nevermore”, Edgar Allan Poe
    • coherence: e.g. “Who controls the past, controls the future”, George Orwell “1984”
    • subtextualization: e.g. “Fair is foul and foul is fair”, William Shakespeare, “Macbeth” (and almost all of “Pulp Fiction”)
    • internal reference: e.g. all “leitmotivs” in Operas

Motifs in Biology

In biology we find a number of meaningful motifs that serve one (or more) of the above functions.

  • Genome: Codons, Transcription factor binding sites, CpG islands,
  • All areas of the genome that interact with proteins in sequence-dependent manner
  • Protein: Patterns of aminoacids that are related to particular function, modules, domains etc

Motifs in Biology

Probabilistic definition of a motif

We can think of such a probabilistic approach that takes into account the variability in each position. Starting with a number of sequence instances of equal size we may calculate the probabilities of occurrence of each nucleotide for each position. The resulting table of the probabilities is a called a Position Weigth Matrix, PWM.

This approach allows us to evaluate the variability of some positions against others and in addition to see which residues tend to be more important in which position.

Accounting for background nucleotide composition

Sequence variation does not only occur within the motif’s instances. Fluctuations in the compositions of DNA sequences occur throughout the genomes of all eukaryotes and specific regions may be enriched in some residues at the expence of others.

This is particularly evident in the fluctuation of GC% along eukaryotic chromosomes, which are divided in regions of high/low GC% that are called “isochores”.

The variability in the genomic sequences will undoubtedly interfere with the motif instances. Binding sites embedded in high GC% regions of the genome will be more GC-rich themselves. In order to account for this variability we need to normalize against what we call the background nucleotide composition.

We achieve this with a different kind of table called Position Specific Scoring Matrices, PSSM.

These are created by dividing the observed probabilities per position in the original PWM (P) with the expected values that are obtained for a given sequence (Q).

A log2-transformation step results in a table where high positive values are corresponding to preferred combinations of residues-positions. A PSSM table is always specific for a given sequence in which one can search for a motif, which brings us to the next problem…

Finding a motif in a sequence with a PWM/PSSM

PWM and PSSM tables are important not only for representations but also for the search and identification of motif instances in an unknown sequence.

Given a table (be it PWM or PSSM) and a longer sequence in which we are searching for the motif we can think of a strategy with which:

  1. We start at position 1 of the sequence
  2. Extract a subsequence \(S_1\) equal in length the the length of the PWM/PSSM
  3. Compare \(S_1\) with the PWM/PSSM by identifying the score of each nucleotide in each position
  4. Calculate a mean score of \(S_1\)
  5. Go to the next position (=2) of the sequence
  6. Repeat steps 1-5.

Strong vs Weak. Evaluating motifs.

Up to now we have seen how every motif can be described as a PWM/PSSM. A different question is related to how different PWM may be describing different patterns. The PWM for NF-kB may contain some very clear positional tendencies in the first and last positions, but this is not the case for the mid-positions. A motif for a different transcriptional regulator may have completely different preferences. We are interested in understanding how the underlying variability in each position and in the motif as a whole may be assessed and what it may tell us about the motif in general.

The question may be rephrased using the concept of “motif strength”. A “strong motif” is one that allows little variability. It means that it is very often found occurring in the same way. On the other hand a “weak” motif has a highly variable structure and thus may be found in a genome in many different variations.

Mathematics Interlude: Variation, Information and Entropy

  • In 1948 Claude Shannon’s pioneering work on message transmission introduce a fundamental concept and gave rise to a whole field of Science called “Information Theory”.

Claude Shannon

  • The basis of information theory is the concept of Entropy which is defined in the following:

    • Given the set \(S\) of \(n\) probable outcomes of a “source”, each of which has probability \(P[i]\)
    • The “Shannon” Entropy of this source is equal to the negative sum of the products of those probabilities and their logarithms, such as: \[H(S)=-\sum_{i=1}^{n} P[i]log(P[i])\]

Mathematics Interlude: Information as Entropy

  • It derives from Shannon’s formula that Entropy maximizes when all possible outcomes have equal probability.
  • This is directly related to the notion of Entropy as you know it from Physics. Can you see how?

When all outcomes have equal probability we are in a position of maximum uncertainty. In principle, out of all the possible things that can happen, we cannot dare make a guess in favour of one or another. We are thus in the worst possible state in terms of being informed about the system under study. Thus when all outcomes are equiprobable, uncertainty is maximum and information is minimum. This is the link between variablity, uncertainty and information that allows us to assess the strength of motifs.

Evaluating a motif with Information (I)

How can we then employ this concept in the case of DNA sequence motifs? Let’s say that you want to “guess” what a motif is. You will need to guess the residue occuppying each position. Considering there are four different outcomes (nucleotides) for each position, we can calculate the “before” Entropy, as being the maximum, with all outcomes equiprobable (P=0.25) to a total, maximum \(H=2\):

\[H(S)_{before}=-\sum_{i=1}^{4} P[0.25]*log(P[0.25])=2\]
This will be the same for every position in the motif.

What is then the entropy once the message has been transmitted? We will denote as “after” the entropy, which we can calculate directly from the PWM. Remember that the PWM contains the probabilities of each position:

\[H(S)_{after}=-\sum_{i=1}^{4} P[i]*log(P[i])=H\]

and thus:

\[I(S)=2-H(S)_{after}\]

The key is that the smaller the \(H(S)_{after}\) the more we have gained as information, since we are reducing the uncertaintly of the message.

Below you see how this is calculated for the NF-kB case:

Notice (red boxes) how some positions have high I, with low uncertainty, while others low I, when there is variability in the observed residues.

Plotting Information in Sequence Logos

Bioinfromaticians have come up with a nice way to represent the information content of motifs. In these, called Sequence Logos, each residue is shown as a coloured letter, whose height corresponds to its contribution to the overall I of the position (which is \(-P_ilog(P_i)\), with \(P_i\) being its probability in the given position). High bars correspond to high information content.

Motif representations (PWM/PSSM)

Going back to the motif search with PWM/PSSM, below you may see the logos obtained for the collection of the top 5% high scoring instances obtained with PSSM (top) and PWM (bottom) searches. Notice how the more “noisy” PWM-search gives rise to less “clear” motifs that eventually are more vague and score lower Information contents, in contrast to the much stronger PSSM-search motifs. The difference in I is considerable (13.7 against 8.6 out of a possible maximum of 20).

Part B. Genomic Variation

For the second part of our discussion of variability in sequence context, we will turn to the problem of assessing and interpreting genomic variation.

What is genomic variability?

With the term “Genomic Variation” (or variability) we refer to the qualitative and quantitative differences in the genomic sequences of members of a population.

Depending on the extent (single-nucleotide or more extended regions) we may be referring to single nucleotide variants (SNV) or copy number variants (CNV).

Depending on the frequency of the variant in the population we may distinguish between common and rare variants.

Depending on the effect a variant has on the observed phenotype we may be talking about polymorphisms (with little or no effect) and causal variants (with large, and in some cases devastating effects).

Why is it interesting

Variability is interesting in itself, but genomic variants that are associated to phenotypic properties, especially with pathological conditions, are of extreme importance in biomedicine. Below you see an example of a series of causal SNV in the human CFTR locus which is associated with Cystic Fibrosis. A single nucleotide variant called \(\Delta\)F508 is the most strongly linked variant with the occurrence of the disease.

Experimental approach

Large-scale studies of genomic variability are now possible through the application of Whole-Genome or Whole-Exome sequencing.

Which is then followed by a series of computational steps including:

  • the mapping of the sequences against the genome of reference (remember the Suffix Trees)
  • the definition of the variants (remember ANOVA)
  • the interpretation/functional study of the variants

Single nucleotide variants

They are identified as positions in the genome with increased variability. Although this is a simplification, you may consider this like a “sliding window” F-test that takes place on every position in the genome. Everytime a nucleotide position shows increased variability beyond a given threshold, it is flagged as a possible variable region. A variant may also be not variable as long as it is different from the reference genome sequence in the same position. (This is because the reference genome, already implies an assessment of variability since the nucleotide at this position is the one most commonly occurring in the population).

Below you see an example of no variability (first two panels), a homozygous variant (second panel, all reads are different than the reference) and a heterozygous variant (fourth panel, approximately half of the reads are variable compared to the genome reference).

#### Interpretation of variants

Once a variant is identified the next steps for its interpretation may include:

  • Filtering out known polymorphic positions
  • Identification of the genomic location of variants (genic, exonic, intronic etc)
  • Prediction of the possible impact of the variant (is it lying in conserved sequences? regulatory? coding?)
  • Cross-comparison with other sources of knowledge for the functional annotation (what could they do)

Assessment of Genomic Variability

It is quite easy to see how variability is assessed in a single genomic position. Let us imagine a simple case of a single nucleotide variant that we want to assess from a functional point of view. This means that we want to realize if it is: a) common or rare b) causal or not

Answering the first question is simple. We just sequence a lot of people at that particular locus and evaluate the frequency the variant.

Answering the second question is much more complex. We first need to have a strong qualitative assumption that a particular position is linked to a particular phenotype. This does not happen very often. But assuming this is the case, we can see how we would approach this question in the following figure.

Here, we compare two groups of people, one with a certain pathological phenotype (left) and one, consisting of healthy individuals (control group, right). By checking the sequence of the particular variant in each group we find 26 out of 32 patients to be heterozygous in the variant (red), while only 2 out of 32 healthy individuals have the same (C/G) genotype, the other 30 being homozygous to the common variant (C/C).

What does this mean? Is the variant likely to be associated with the condition? And if yes, how strongly? How can we assess the difference between heathy individuals and patients?

Statistics Interlude. Odds Ratios

From a statistical point of view this is a very simple case of assessing a 2X2 contingency table. Ronald Fisher allegedly solved this problem to examine a colleague’s potential to realize whether cream was added before or after tea was poured into her cup. This is why we often refer to the test as Fisher’s test. This is a test of exact statistics, meaning that it may actually calculate the exact probability of a table being like the one observed, given two groups and two conditions and 64 objects distributed randomly between them.

Let’s start by creating the table in R:

genotype<-rbind(c(6,26), c(30,2))
colnames(genotype)<-c("reference", "variant")
rownames(genotype)<-c("Patients", "Healthy")
genotype
##          reference variant
## Patients         6      26
## Healthy         30       2

and now let’s apply Fisher’s test on the table:

fisher.test(genotype)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.001596133 0.094608419
## sample estimates:
## odds ratio 
## 0.01718739

You see that the test calculates a certain quantity called “odds ratio”, for which it also calculates a 95% confidence interval. The Odds Ratio (or OR) is calculated, as its name implies, as the ratio of the two “odds” of having the variant depending on your condition.

This is shown in the table below

and for our case is equal to (a slightly adjusted) ratio of \(OR=(6/30)/(26/2)\).

Notice how the expected value of OR under randomness is equal to 1 and this is exactly the value against which the test is performed (“true odds ratio is not equal to 1”).

fisher.test(genotype)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.001596133 0.094608419
## sample estimates:
## odds ratio 
## 0.01718739

The significance of the OR is evident in the very small value (<0.1), the narrow confidence interval, which is very far from 1 and the p-value of \(<=10^{-10}\).

Odds Ratios are the way we report the tendency of a variant to appear more frequently in one phenotype group than another. They represent the association of that variant for the particular phenotype. In general OR values >> 1 or << 1 are highly indicative of strong association.

Odds Ratios are supported with exact p-values obtained through the application of the Fisher’s test. These p-values are thus the probability of observing a given contingency table (like the one above) given that there is no preference of Groups 1,2 for K1 or K2 categories.

The p-value that is linked with the OR is, as mentioned above, an exact probability calculated with the help of the hypergeometric distribution. (In the figure below, you should interpret the \(\binom{x}{y}\) notation as the number of combinations of \(y\) selected from \(x\).)

Complexity of Statistics Interlude. Multiple loci

This is the case for one locus, but in the real world, we very rarely know the associated locus beforehand. Most of the times we are in search of a locus, or multiple loci that are associated with a condition. This is the case of the Genome Wide Association Studies, (or GWAS) in which multiple loci are tested for association with a given condition.

The result is a huge number of p-values which we often sort according to their size (transformed as \(-log10(p)\)) in plots like the one below. Because of their shape, with few positions having very high -log10(p) (and thus being highly significant) that resemble a skyscraper-skyline, we call them “Manhattan plots”.

These contain thousands of SNVs, for which we calculate the ratio of frequencies between diseased/healthy samples.

Functional interpretation of Genomic Variability

The accumulation of information of variant-condition associations allows us to have a concise view of the pathological states that are associated with genome variation and -more importantly- the way with which this happens.

We already new that some diseases, such as the thalassemias or Cystic Fibrosis are associated with only one locus. We call these diseases highly penetrant, meaning that carrying the variant is almost certain to lead to the disease. On the other side of the spectrum, low-penetrance variants are very mildly associated with a condition (that is with very low OR). Penetrance is also related to the frequency with which a condition occurs in the population. A variant may be frequent if it has small penetrance, because this allows its spread in the population. [But in some cases we can have high frequency with high penetrance. Can you think why this happens?]

The relationship between the two is shown in the diagram below for some of the most well-studied genetic diseases.

Assignment

For this week’s assignment you are asked to perform a motif building and search analysis using a set of defined motif instances and custom-made R functions.

Your goal is to create a PWM with a set of instances of the GATA transcription factor, transform it into PSSM and then search it against a whole genome sequence.

  1. You may start by downloading a set of GATA binding sequences that you will find here.

  2. Once this is done, you can download the target sequence against which you will search the PWM here.

  3. In order to build the PWM you need to read the GATA sequences using this custom function and then create the matrix with the PWM function. In order to use these functions you have to upload them to R using:

source("nameoffunction.R")

This means, that you may use \(readfastafile()\) as:

source("readfastafile.R")
gataseqs<-readfastafile("gata.fa")

and then obtain the PWM with:

source("seqMotif.R")
gataPWM<-seqMotif(gataseqs, drawmotif=F)
  1. The transformation of the PWM to PSSM requires an additional random sequence collection, which you can find here. You can then create the PSSM by taking the log2(ratio) of the two PWMs (one from the GATA sequences over the one from the random sequence).

  2. In the next step you will use a PSSMSearch Function to search for the motif in the longer target sequence. Assuming you have created the PSSM into the table called \(pssm\), the list of commands below should work:

source("pssmSearch.R")
targetset<-readfastafile("test1.fa")
pssmSearch(pssm, targetseq, threshold=X)

The parameter X in the threshold tells the search function to return a different number of successful hits for the search. Threshold can take values from 0 to 1, with 1 returning the instances of the binding sites that have score equal to the maximum (100%) of the PSSM. A value of 0.8 will return values with scores 80% or more of the maximum and thus a greater number of hits.

  1. As a last step you are asked to do the following. Perform three different searches with threshold=0.5, 0.8 and 1.0 and obtain the results in a vector. Then use this vector of positions to extract the sequences from the target sequence (test1.fa). Then store these sequences in a table (you can use “write.table”) and paste them to the WebLogo WebService to obtain three different sequence logos. Compare the three and discuss their differences.